-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/ROCM/COPY: Use faster memcpy for device to host copies #4532
Conversation
Can one of the admins verify this patch? |
ok to test |
Mellanox CI: FAILED on 3 of 25 workers (click for details)Note: the logs will be deleted after 10-Dec-2019
|
ce7608e
to
2bf407f
Compare
@souravzzz Do you happen to have performance numbers that can help support this optimization ? |
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 10-Dec-2019
|
Hi @shamisp here are some intra-node D2D numbers with rocm_copy transport that shows the improvements from this change.
|
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 10-Dec-2019
|
Looks impressive. What about d2h ? thanks |
@shamisp We see good improvement for D2H transfers as well.
|
Looks good to me. @yosefe ? |
2bf407f
to
80bbc67
Compare
Thanks for the feedback @yosefe. I have incorporated the suggested changes. |
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 11-Dec-2019
|
@souravzzz Curious to know if this test has the GPU touch the buffers transferred before each transfer. |
@Akshay-Venkatesh No these are results from the standard osu_latency benchmark |
What
Use faster memcpy for device to host copies
Why ?
Native memcpy is slow for device to host copies
How
Use non-temporal load (MOVNTDQA) to implement memcpy